Re-write partitioner to use ColumnChunks instead of ValueVectors #2979

benjaminwinger · 2024-02-29T23:48:20Z

ValueVectors have high memory fragmentation, and allocate strings in 256KB chunks for only 2048 strings.
ColumnChunks can have a much larger capacity, and also support string de-duplication.

Fixes #2863 (peak memory usage when copying the dataset was 13GB (36GB before I enabled string de-duplication in the partitioner's column chunks).
Fixes #2957

I'm not sure if we should leave string compression enabled in the Partitioner, as it can increase memory usage for datasets with little duplication. At the least it should probably follow the global compression setting (which I don't think is easily accessible to the Partitioner).

benjaminwinger · 2024-03-04T20:01:03Z

I've done some optimizations and memory usage is now ~16GB without compression on that dataset (compression on the chunks used in the partitioner has been disabled for now. It may be worthwhile to look into compressing ColumnChunks in-memory more broadly when we're accumulating intermediate values, but it just makes this change more complicated).

ColumnChunk capacity was left at 2048 (same as the previous implementation using DataChunkCollection/ValueVector) since each thread needs to allocate a chunk for each node group for each property in both directions, which adds up quickly and if it's too high can lead to much more data being allocated when initializing the per-thread state than is necessary to store the data.

codecov · 2024-03-04T20:17:08Z

Codecov Report

Attention: Patch coverage is 89.37500% with 17 lines in your changes are missing coverage. Please review.

Project coverage is 93.25%. Comparing base (5e598ec) to head (38e4398).

Files	Patch %	Lines
src/storage/store/column_chunk.cpp	67.44%	14 Missing ⚠️
src/storage/store/var_list_column_chunk.cpp	91.30%	2 Missing ⚠️
src/include/storage/store/node_group.h	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2979      +/-   ##
==========================================
- Coverage   93.31%   93.25%   -0.06%     
==========================================
  Files        1124     1124              
  Lines       42912    42934      +22     
==========================================
- Hits        40042    40040       -2     
- Misses       2870     2894      +24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/include/processor/operator/partitioner.h

src/processor/operator/partitioner.cpp

ray6080 · 2024-03-05T13:39:34Z

src/include/storage/store/column_chunk.h

@@ -151,12 +151,12 @@ class BoolColumnChunk : public ColumnChunk {
              enableCompression, hasNullChunk) {}

    void append(common::ValueVector* vector) final;
+    void appendOne(common::ValueVector* vector, common::vector_idx_t pos) final;


This is fine temporarily, but we should refactor the two interfaces append(common::ValueVector* vector) and appendOne(common::ValueVector* vector, common::vector_idx_t pos) into a single one by passing SelVector as an argument,

void append(common::ValueVector* vector, SelVector& sel);

ValueVectors have high memory fragmentation, and allocate strings in 256KB chunks for only 2048 strings. ColumnChunks can have a much larger capacity, and also support string de-duplication.

benjaminwinger force-pushed the rel-memory-fix branch from a5b30d6 to be2b2fe Compare March 4, 2024 19:59

benjaminwinger force-pushed the rel-memory-fix branch from be2b2fe to 6151053 Compare March 4, 2024 20:45

benjaminwinger mentioned this pull request Mar 4, 2024

More efficient ColumnChunk string dictionary caching #2994

Merged

ray6080 approved these changes Mar 5, 2024

View reviewed changes

benjaminwinger force-pushed the rel-memory-fix branch 3 times, most recently from 1a79a66 to 2042c47 Compare March 8, 2024 15:57

Re-write partitioner to use ColumnChunks instead of ValueVectors

38e4398

ValueVectors have high memory fragmentation, and allocate strings in 256KB chunks for only 2048 strings. ColumnChunks can have a much larger capacity, and also support string de-duplication.

benjaminwinger force-pushed the rel-memory-fix branch from 2042c47 to 38e4398 Compare March 8, 2024 16:03

benjaminwinger merged commit b7e3bc7 into master Mar 8, 2024
15 checks passed

benjaminwinger deleted the rel-memory-fix branch March 8, 2024 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-write partitioner to use ColumnChunks instead of ValueVectors #2979

Re-write partitioner to use ColumnChunks instead of ValueVectors #2979

benjaminwinger commented Feb 29, 2024 •

edited

Loading

benjaminwinger commented Mar 4, 2024

codecov bot commented Mar 4, 2024 •

edited

Loading

ray6080 Mar 5, 2024

Re-write partitioner to use ColumnChunks instead of ValueVectors #2979

Re-write partitioner to use ColumnChunks instead of ValueVectors #2979

Conversation

benjaminwinger commented Feb 29, 2024 • edited Loading

benjaminwinger commented Mar 4, 2024

codecov bot commented Mar 4, 2024 • edited Loading

Codecov Report

ray6080 Mar 5, 2024

Choose a reason for hiding this comment

benjaminwinger commented Feb 29, 2024 •

edited

Loading

codecov bot commented Mar 4, 2024 •

edited

Loading